This notebook supports an asynchronous lecture meant to introduce you to some basics of spatial data. The lecture will be provided via Echo360.
I will provide you with a partially completed notebook to assist your work.
This notebooks will focus on “spatial data” and also accessing demographic data in R.
Key topics for today:
here package/data_raw/ directoryStandards:
library(knitr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # because we will probably see some dates
library(here) # more easily access files in your project
## here() starts at C:/Users/lucas/OneDrive/Desktop/bikeshare_2023
Some additional packages focuses on today’s work:
library(sf) # working with simple features - geospatial
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.2
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.3.2
A link to a book on tmap: https://r-tmap.github.io/
I will use the GeoJSON file. (Newer, not necessarily better, but … a single file. Not smaller, but … this one is not big.)
https://opendata.dc.gov/datasets/DCGIS::dc-health-planning-neighborhoods/about
Load the neighborhood geospatial data as neigh.
neigh = st_read(here("data_raw", "DC_Health_Planning_Neighborhoods.geojson")) %>% clean_names()
## Reading layer `DC_Health_Planning_Neighborhoods' from data source
## `C:\Users\lucas\OneDrive\Desktop\bikeshare_2023\data_raw\DC_Health_Planning_Neighborhoods.geojson'
## using driver `GeoJSON'
## Simple feature collection with 51 features and 8 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99556
## Geodetic CRS: WGS 84
Ensure it plots.
plot(neigh)
Download the DC covid datase for positive cases and store at an appropriate place in your project.
Read the data as df_c and be sure to clean names.
df_c = read_csv(here("data_raw", "DC_COVID-19_Total_Positive_Cases_by_Neighborhood.csv")) %>% clean_names()
## Rows: 26372 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): DATE_REPORTED, NEIGHBORHOOD
## dbl (2): OBJECTID, TOTAL_POSITIVES
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now - let’s focus on a particular date (for no reason other than simplifying our analysis).
df_cases = df_c %>%
filter(as_date(date_reported) == "2021-11-17") %>%
separate(neighborhood, into=c("code", "name"), sep = ":")
Create the dataframe df_cases:
df_cases=df_c %>%
filter(as_date(date_reported) == "2021-11-17") %>%
separate(neighborhood,into=c("code","name"),sep = ":") %>%
mutate(code=case_when(code=="N35" ~"N0",
TRUE ~ code)) %>%
select(-date_reported)
Join the dataframes and make a chloropleth map using tmap.
neigh2 = left_join(neigh, df_cases, by = c("code"))
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(neigh2) + tm_polygons("total_positives", alpha = 0.5)
Let’s get some data using tidycensus. Need an API key https://api.census.gov/data/key_signup.html
census_api_key("fd79001f8e491bb60620a3e5dfeb02656d0af22e")
## To install your API key for use in future sessions, run this function with `install = TRUE`.
What data is available — and what is the variable name?
(We are interested in the 5year American Community Survey.)
#what variables
v20 = load_variables(2018, "acs5")
#Variables include "Sex by Age" (B01001 variables), Race (B02001), and "People Recording Single Ancestry" (B04004), amongst others.
Get some data:
df_census=get_acs(geography = "tract",
variables=c("median_inc"="B06011_001",
"pop"="B01001_001",
"pop_black"="B02009_001"),
state="DC",geometry=TRUE,year=2021)
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
|========================= | 36%
|
|========================================================= | 81%
|
|======================================================================| 100%
Make a plot to verify that you read the data:
plot(df_census)
It’s in long format. Let’s make it wide.
df_cens=df_census %>%
select(-moe) %>%
pivot_wider(names_from = "variable",
values_from = "estimate")
tm_shape(df_cens) +tm_polygons("median_inc",alpha=.5)
Consider this problem:
tm_shape(neigh2) +tm_borders(col="blue",lwd=5,alpha=.2)+
tm_shape(df_cens) +tm_borders(col="red",lwd=1,alpha=.3)
OK - follow the challenging code elements:
You need to add a coordinate system to the census data:
df_cens_adj = df_cens %>% st_transform(4326)
But which way do we join — and — think about how it should “aggregate” the data.
df_j = st_join(df_cens_adj, neigh2, largest = TRUE)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
df_j_rev = st_join(neigh2, df_cens_adj, largest = TRUE)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
Since we want the geometry for the NEIGHBORHOODS, we need a different work a little harder:
df1=df_j %>% select(median_inc,pop,pop_black,code) %>%
group_by(code) %>%
summarise(pop_n=sum(pop),
pop_black_n=sum(pop_black),
adj_median_income=sum(pop*median_inc)/pop_n)
plot(df1)
Now that we are aggregating in the right way, we can join.
#df2=left_join(neigh2,df1)
df2=left_join(neigh2,df1 %>% st_set_geometry(NULL))
## Joining with `by = join_by(code)`
And visualize:
df2=df2 %>% mutate(black_perc=pop_black_n/pop_n, covid_rate=total_positives/pop_n)
tm_shape(df2)+tm_polygons(c("adj_median_income","covid_rate","black_perc"))
Improve that visualization:
df2 %>% filter(code!="N0") %>%
tm_shape() + tm_polygons(c("adj_median_income", "covid_rate", "black_perc"), alpha = .4)